January 30, 2025
\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]
When there is no way to analytically solve for the empirical risk minimizer, we need computational optimization methods to find the values of any unknown coefficients.
The most common approach when we can evaluate the gradient of the loss function at any candidate set of values is gradient descent:
\[\bb_{t + 1} = \bb_{t} - \eta_t \mathbf g(\bb_t)\]
\(\mathbf g(\bb_t)\) is the gradient of the loss function evaluated at the current value
\(\eta_t\) is the step size at iteration \(t\)
For most problems, the loss is expressed in the form:
\[\expn \mathcal L(y_i , \hat{y_i})\]
Meaning that:
\[\mathbf g(\bb) = \expn \mathbf g_i(\bb)\]
As \(N\) and/or \(P\) gets large, this does not scale particularly well!
Instead, we can take advantage of the LLN and approximate the gradient at each step using a sample search
Even when the batch size is 1, with sufficient learning rate schedule, we’ll land at the minimum
The benefit is a much less computationally intense optimization routine!
The cost is the noise.
To demonstrate:
1000 observations of a basic binary classifier
50 features
Only 2 really matter! The rest are just noise.
Both methods get where they need to go!
In both cases, though, the move from start to minimum is quite inefficient
In the gradient descent case, it takes the full 10K to get there not actually reaching the stopping criterion
SGD is similar.
This means that the optimization path is making inefficient moves searching solely for the direction of steepest descent
Very sensitive to step size!
Let’s start by examining regular ol’ gradient descent
A Method that Reduces Some Sensitivity and Improves Efficiency: Learning Rate Schedules
Rather than using a fixed step size, start with a large step size and temper the learning rate as the algorithm proceeds.
Most common scheduler: Polynomial Decay
\(\eta_t = \eta_0(b t + 1)^{-a}\)
This schedule starts with large steps then quickly decays to smaller steps.
This make sense:
Our initial guess is probably pretty bad. Move away from it in a hopefully decreasing direction quickly.
Once we move onto a path that is decent, start moving a little slower.
At the end, we’re hopefully around the minimum and we’ll tiptoe towards the actual minimum
Static learning rate schedules aren’t a solve all!
Gradient descent moves quickly down the hill
But, then it hits a flat minimum
Moves very slowly.
Even with a static learning rate schedule:
The moves through space are low
Require not starting too big
Jumpy if the choice of decay function isn’t just right
A first improvement to gradient descent leverages an alternative proof that it works - multivariate Taylor Series approximations.
Let \(\mathcal L(\bb_t)\) be the loss evaluated at coefficients at time \(t\). We can use a Taylor series expanstion to approximate the surface of the loss function in the neighborhoodd of the evaluation.
A first order approximation:
\[\mathcal L(\bb) = \mathcal L(\bb_t) + \mathbf g(\bb_t)^T (\bb - \bb_t) + \mathcal O(1)\]
A second order approximation:
\[\mathcal L(\bb) = \mathcal L(\bb_t) + \mathbf g(\bb_t)^T (\bb - \bb_t) + \frac{1}{2} (\bb - \bb_t)^T \mathbf H(\bb_t) (\bb - \bb_t) + \mathcal O\left(\frac{1}{2}\right)\]
If our goal is to find the smallest value of the loss, then it suffices to use the Taylor series approximation to minimize the loss
For the first order approximation:
\[\bb_{t + 1} = \bb_{t} - \eta \mathbf g(\bb_t)\]
For the second order approximation:
\[\bb_{t + 1} = \bb_{t} - \eta \mathbf H(\bb_t)^{-1} \mathbf g(\bb_t)\]
Why might the second order approximation work better?
Pre-conditioning on the inverse Hessian deskews the loss space
Recall that the gradient for logistic regression is:
\[\mathbf g(\bb_t) = \expn \mathbf x_i(\sigma(\mathbf x_i^T \bb_t) - y_i)\]
And the Hessian:
\[\mathbf H(\bb_t) = \expn \mathbf x_i \mathbf x_i^T\sigma(\mathbf x_i^T \bb_t)(1 - \sigma(\mathbf x_i^T \bb_t))\]
The first order gradient descent method only makes moves that are orthogonal to the level sets
The second order method is able to ignore the level sets and make moves that go in other directions!
Make big moves at the start to try to find a point orthogonal to the minimum - not the level sets
Once we get that point, immediately dive to the minimum
Called Newton’s Method and is the top performer when the loss space is strictly convex
But, takes way more time per iteration.
Any thoughts as to why?
The big problem - matrix inversion!
The most efficient matrix inversion methods scale as \(\mathcal O(P^3)\)-ish.
Every new feature we add leads to a cubic increase in time needed!
Won’t scale well to high-dimensional data
We’re quickly going to grow out of the realm where it is viable to compute the Hessian
Can we improve the first order methods?
We previously saw one big speed problem
The first order method works well at the beginning when the loss function is decreasing rapidly
Slows to a halt in flat regions
The gradient is small, so the change from \(t\) to \(t + 1\) is also small.
Maybe we can speed it up in the flat regions to add some speed?
A heuristic method that has good properties:
Let
\[\mathbf m_{t + 1} = b \mathbf m_t + \mathbf g(\bb_t)\]
\[\bb_{t + 1} = \bb_t - \eta_t \mathbf m_t\]
where \(0 \le b \le 1\) is a constant.
This is called momentum
In each iteration, we are moving along the same direction
Even though we know that we’re probably going to keep moving in the same direction at the next step, we are only moving \(\eta\) forward each time!
\[\mathbf m_{t + 1} = b \mathbf m_t + \mathbf g(\bb_t)\]
At time \(t + 1\):
\[\mathbf m_{t + 1} = b \mathbf m_{t} + \mathbf g(\bb_t)\]
\[\mathbf m_{t + 1} = b^2 \mathbf m_{t - 2} + b \mathbf g(\bb_{t - 1}) + \mathbf g(\bb_t)\]
\[\mathbf m_{t + 1} = b^3 \mathbf m_{t - 3} + b^2 \mathbf g(\bb_{t - 2}) + b \mathbf m_{t - 1} + \mathbf g(\bb_t)\]
\[\mathbf m_{t + 1} = \sum \limits_{h = 1}^t b^h \mathbf g(\bb_{t - h + 1})\]
How does this speed things up in the flat minimum case?
The gradient in this space remains constant
\[\mathbf m_{t + 1} = \mathbf g(\bb_{t}) \sum \limits_{h = 1}^t b^h\]
Assuming that this sequence continues indefinitely:
\[\sum \limits_{h = 1}^\infty b^h = \frac{1}{1 - b}\]
Letting \(b = .9\), as we stay on the same gradient for a while, this gets closer and closer to 10
This increase in speed isn’t uniform, though!
\[\mathbf g(\bb_t)^T = [-1,3] \text{ ; } \mathbf g(\bb_{t - 1})^T = [1,3]\]
\[\mathbf m_{t + 1} = \mathbf g(\bb_t) + b \mathbf g(\bb_{t - 1}) \sum \limits_{h = 1}^{t - 1} b^h \mathbf g(\bb_{t - h + 1})\]
with \(b\) sufficiently close to one:
\[\mathbf g(\bb_t) + b \mathbf g(\bb_{t - 1}) \approx [0,3]^T\]
The sharp change in the gradient was shrunk back towards zero
The direction that had previously been seen continues to increase in step size
Think of it like a heavy ball or a rolling stone
If the hill is always in the same direction, keep rolling faster!
If the hill has lots of sharp changes, hit the wall and stop!
It’s really fast!
Cons:
If the loss function is really irregular, then it’ll just move incredibly slow for all eternity
Requires storing two \(P\)-vectors: \(\mathbf m\) and \(\mathbf g(\bb_t)\). Modern machines with plenty of RAM won’t have a problem with this, though!
Will second order methods and/or momentum work with smaller batch sizes?
Thoughts?
Do you think that the average Hessian converges as quickly in \(N\) compared to the gradient?
With batch size 1, do you expect that the gradient will ever really build much momentum?
Problems unique to SGD:
Variance around the true minimum
High sensitivity to learning rate
The Hessian isn’t going to converge that quickly due to the small batch sizes and high dimensionality
One solution is to create a learning rate schedule that decays over time.
Problem:
It’s a preset schedule that doesn’t adapt to the specifics of the loss function
Another thought is to use the Hessian
Maybe the idea of multiplying by the inverse Hessian is useful, though?
\[-\mathbf H(\hat{\boldsymbol \Theta})^{-1}\]
In statistical theory, called the Fisher information
Associated with the variance of the parameters at a particular point in the parameter space
Can think of the inverse Hessian as a variance covariance matrix
From a calculus perspective:
The diagonal of the inverse Hessian tells us how sharp the loss surface is with respect to a specific variable
A small diagonal element (small variance) means that a small change to \(\hat{\beta}_j\) will result in a large change in the loss
A large diagonal element (large variance) means that a small change to \(\hat{\beta}_j\) will result in a small change in the loss
The off diagonals tell us the covariance between the two parameters
A thought:
If we want to choose the move of size \(\eta\) that will most decrease the loss in the parameter space, we want to set the descent direction mostly to parameters that have small diagonals in the inverse Hessian!
This is how second order methods race through the loss space
Make big changes on parameters that have small diagonal inverse Hessian elements
Make small changes on parameters that have big diagonal inverse Hessian elements
If we precondition the gradient on the Hessian, then we move through the space in an optimal way!
But, the Hessian is prohibitively espensize to calculate/hard to approximate using SGD
The BHHH approximation to the Hessian
\[\mathbf H(\bb_t) \approx \mathbf g(\bb_t) \mathbf g(\bb_t)^T\]
Often called the Outer Product of the Gradient Approximation
Arises as a consequence of MLEs and Fisher information
\[\mathbf H(\bb_t) \approx \mathbf g(\bb_t) \mathbf g(\bb_t)^T\]
This approximation only reaches equality at a point in the loss space where the gradient is equal to zero.
But, it is sort of close elsewhere.
The key thing here is that we already have the gradient evaluated at a point for gradient descent procedures!
Even more clever:
Let’s just ignore the off-diagonals of the Hessian and concentrate on the diagonal
With the BHHH approximation:
\[\tilde{\mathbf H}(\bb_t) = \begin{bmatrix} \mathbf g(\beta_{1t})^2 & 0 & 0 & \ldots \\ 0 & \mathbf g(\beta_{2t})^2 & 0 & \ldots \\ 0 & 0 & \mathbf g(\beta_{3t})^2 & \ldots \\ \vdots & \vdots & \vdots & \ddots \end{bmatrix} \]
Approximating the second order search:
\[ \bb_{t + 1} = \bb_t - \eta \tilde{\mathbf H}(\bb_t)^{-1} \mathbf g(\bb_t)\]
With a few more adjustments to handle numerical instability and to take advantage of SGD, we get the AdaGrad algorithm:
\[ \bb_{t + 1} = \bb_t - \eta \frac{\mathbf g(\bb_t)}{\sqrt{\mathbf s_{t} + \epsilon}}\]
where
\[\mathbf s_t = \sum \limits_{h = 1}^t \text{diag}\left(\tilde{\mathbf H}(\bb_h)\right) = \sum \limits_{h = 1}^t \mathbf g(\bb_h)^2\]
and \(\epsilon\) is a small positive value to prevent division by zero.
AdaGrad has an adaptive learning rate for the gradient
We set the initial learning rate
Have higher learning rates when the parameter values are associated with high curvature (low inverse of the gradient)
Diminish the learning rate over time - the running sum of the squared gradient values
Think of this as a smarter learning rate decay schedule that doesn’t require choosing a decay rate
Sort of like momentum for the “Hessian”
Note: AdaGrad can use a stopping criterion
Not always a good idea!
It’s quite quick!
Still a little inefficient.
Spends most of its time tiptoeing to the minimum
This is partially because dividing by the sum of the squared gradient values goes to zero too quickly!
An improvement: RMSProp
\[ \bb_{t + 1} = \bb_t - \eta \frac{\mathbf g(\bb_t)}{\sqrt{\mathbf s_{t} + \epsilon}}\]
where
\[\mathbf s_t = a \mathbf s_{t - 1} + (1 - a) \mathbf g(\bb_t)^2\]
where \(0 \le a \le 1\).
Instead of a raw average, implement a decaying moving average of the approximate diagonal of the Hessian
Why not combine this approach of adaptive gradients with momentum?
The Adaptive Moment method (ADAM):
\[ \bb_{t + 1} = \bb_t - \eta \frac{\mathbf m_t}{\sqrt{\mathbf s_{t} + \epsilon}}\]
\[\mathbf m_t = b_1 \mathbf m_{t - 1} + (1 - b_1) \mathbf g(\bb_t)\]
\[\mathbf s_t = b_2 \mathbf s_{t - 1} + (1 - b_2) \mathbf g(\bb_t)^2\]
Let the gradient move like a heavy ball
Also let the “Hessian” move like a heavy ball
Get the benefits of momentum on both elements of the optimization problem!
ADAM implementation:
\(\eta\) is commonly set to .001 or .01. May need to be tuned if you’re having convergence problems.
\(b_1 = .9\)
\(b_2 = .999\)
No real reason other than these values seem to perform well in most settings.
AdaGrad, RMSProp, and ADAM:
Do not require really tuning the initial learning rate
Converge in to the minimum with relatively low noise
Converge into the minimum faster than standard SGD
Different problems will benefit from different approaches!
ADAM is sort of the default optimizer
The other two are a little less computationally intense (they don’t need to carry \(\mathbf m\) and \(\mathbf s\))
Work well for simple problems. ADAM works much better in complicated high dimensional problems.
Why do we care so much about these different optimization methods?
These methods are the building blocks of different deep learning methods!
As long as we can compute the gradient evaluated at the unknowns, we can minimize our loss function
As we’ll soon see, neural networks have a lot of parameters. Many of them will be ill-conditioned and not yield clean minima.
The quality of a deep learning classifier lives and dies by our ability to effectively minimize the loss function
SGD is not optimal for all problems!
Regular ol’ linear methods should still use more standard optimizers
We’ve finished the set of foundational topics that I believe we need to cover to better understand way more complicated linear model structures
Generalization Theory
Likelihood and chained equations
Regularization
Minimization with no analytical solution
Topics that are not covered as in depth in the intro class!
Next class, we’ll start discussing universal approximators and feed forward neural networks
The feature extractor set up
Activation functions and similarities to splines
Multi-unit single layer networks as universal approximators
The goal of this class is to really dig into how and why neural networks and deep learning work